Overview

These exercises will help you practice applying separate, unite, and regular expressions. You will use a messy dataset with information about cardiovascular disease (CVD).

Pre-requisites

Before starting these exercises, you should have a good understanding of

  1. The Tidy your data Primer.

  2. Chapter 12.3 - 12.7 and Chapter 14 of R for Data Science

Setup

knitr::opts_chunk$set(echo = TRUE, message = FALSE)
library(magrittr)

Data dictionary

cvd_messy_descr <-
  c("ID" = 'Participant identification',
    "question_age" = "Question: how old are you / when where you born? Participants 51 or older answered the second question.",
    "question_substance" = 'Question: do you smoke or drink?',
    "question_bp" = 'Question: what is your blood pressure? Are you taking medications to lower your blood pressure?',
    "labs" = 'A collection of laboratory values concatenated into a single string. Notably, the order of lab values is random',
    "cvd_fup" = 'Report of whether this participant exerienced a cardiovascular disease event (i.e., stroke or coronary heart disease) after their interview')

# the enframe function transforms a vector into a tibble,
tibble::enframe(cvd_messy_descr) %>% 
  gt::gt(rowname_col = "name") %>%
  gt::tab_stubhead(label = 'Variable name') %>% 
  gt::cols_label(value = 'Variable description') %>% 
  gt::cols_align('left') %>% 
  gt::tab_header(title = 'Description of messy cardiovascular disease data')
Description of messy cardiovascular disease data
Variable name Variable description
ID Participant identification
question_age Question: how old are you / when where you born? Participants 51 or older answered the second question.
question_substance Question: do you smoke or drink?
question_bp Question: what is your blood pressure? Are you taking medications to lower your blood pressure?
labs A collection of laboratory values concatenated into a single string. Notably, the order of lab values is random
cvd_fup Report of whether this participant exerienced a cardiovascular disease event (i.e., stroke or coronary heart disease) after their interview

Import

cvd_messy <- readr::read_rds('data/cvd_messy.rds')

cvd_messy

Problem 1

Tidy the data up to create the following columns:

  • ID: (numeric) participant identification
  • cvd_status: (numeric) 0 if no CVD, 1 if CVD
  • cvd_time: (numeric) years from interview to CVD or loss to follow-up
  • sbp: (numeric) systolic blood pressure, mm Hg
  • dbp: (numeric) diastolic blood pressure, mm Hg
  • bp_meds: (factor) Yes/No for use of blood pressure lowering medication
  • age_number: (numeric) age in years
  • drink: (factor) Yes/No for drinking
  • smoke: (factor) Yes/No for smoking
  • albumin: (numeric) albumin levels
  • hba1c: (numeric) HbA1C levels
  • creatinine: (numeric) creatinine levels

Each column can be cleaned in a number of different ways.

  • to create cvd_time, I recommend making a slight modification to the regular expression we used in the lecture.

  • A lot of the other variables can be managed with str_detect, str_extract, and str_remove.

  • You could also consider converting some variables into factors with new labels and then using separate.

  • For the lab values, look up a new function: ?separate_rows. The problem will be much harder if you do not use separate_rows

Once you are finished, remove the original messy columns and convert any character valued columns to factors. Your cleaned data should look like this:

readr::read_rds('solutions/01_solution.rds')

Problem 2

Create new columns:

  • diabetes (factor) Yes if HbA1C is greater than 6.5, No if less than or equal to 6.5

  • albuminuria (factor) ‘Yes’ if albumin / creatinine is greater than or equal to 30 and ‘No’ if albumin / creatinine is less than 30

  • bp_midrange (factor) Yes if at least one of the two conditions below are true:

    • SBP is greater than or equal to 130 and less than 140
    • DBP is greater than or equal to 80 and less than 90
  • rec_bpmeds_acc_aha (factor) ‘Yes’ if any of the conditions below are TRUE, and ‘No’ if all of them are FALSE.

    • SBP is greater than or equal to 140 OR DBP is greater than or equal to 90
    • bp_midrange == ‘Yes’ and albuminuria == ‘Yes’
    • bp_midrange == ‘Yes’ and diabetes == ‘Yes’
    • bp_midrange == ‘Yes’ and age_number > 65
  • rec_bpmeds_jnc7 (factor) Yes if SBP >= 140 OR DBP >= 90, ‘No’ if SBP is less than or equal to 140 and DBP is less than or equal to 90.

Note: rec_bpmeds_acc_aha is a simplified version of the 2017 American College of Cardiology and American Heart Association’s BP guidelines.

readr::read_rds('solutions/02_solution.rds')

Problem 3

Use count and mutate, glue, and pivot_wider to make the following table summarizing counts and percent of diabetes, stratified by recommendations to initiate or intensify BP lowering. Remember to group and ungroup the data appropriately.

readr::read_rds('solutions/03_solution.rds')

Problem 4

You might imagine doing Problem 3 for all variables and then dealing with combining results into a participant characteristics table. Sounds pretty tedious, right? The gtsummary package is here for you. Explore the package website and focus on the tbl_summary() vignette. When you are ready, try using tbl_summary() on the data you created.

Before creating your table, make sure that all of the character variables in your data are converted to factor variables, and that all of your factor variables are given an explicit NA coding such that missing values are given a value of ‘Unknown’.

readr::read_rds('solutions/04_solution.rds')
Characteristic Recommended to initiate or intensify medications
to lower BP by the 2017 ACC/AHA guidelines
No, N = 60821 Yes, N = 29931 Unknown, N = 9251
Recommended initiation / intensification by JNC7
No 6082 (100%) 974 (33%) 805 (87%)
Yes 0 (0%) 2019 (67%) 0 (0%)
Unknown 0 (0%) 0 (0%) 120 (13%)
Age, years 51 (43, 61) 63 (54, 70) 54 (46, 61)
Systolic blood pressure, mm Hg 118 (111, 125) 144 (136, 152) 131 (126, 136)
Unknown 0 0 120
Diastolic blood pressure, mm Hg 73 (68, 78) 80 (73, 87) 82 (77, 84)
Unknown 0 0 120
Systolic/diastolic BP 130-140/80-90 mm Hg
No 4907 (81%) 1245 (42%) 54 (5.8%)
Yes 1175 (19%) 1748 (58%) 751 (81%)
Unknown 0 (0%) 0 (0%) 120 (13%)
Currently using BP lowering medication
No 3440 (57%) 980 (33%) 394 (43%)
Yes 2642 (43%) 2013 (67%) 452 (49%)
Unknown 0 (0%) 0 (0%) 79 (8.5%)
Alcohol
No 3008 (49%) 1842 (62%) 436 (47%)
Yes 2982 (49%) 1113 (37%) 481 (52%)
Unknown 92 (1.5%) 38 (1.3%) 8 (0.9%)
Smoking
No 5217 (86%) 2536 (85%) 786 (85%)
Yes 773 (13%) 419 (14%) 131 (14%)
Unknown 92 (1.5%) 38 (1.3%) 8 (0.9%)
Hemoglobin A1c 5.60 (5.20, 6.00) 6.00 (5.50, 6.80) 5.60 (5.20, 6.00)
Unknown 167 113 125
Diabetes
No 5181 (85%) 2018 (67%) 774 (84%)
Yes 734 (12%) 862 (29%) 26 (2.8%)
Unknown 167 (2.7%) 113 (3.8%) 125 (14%)
Albuminuria
No 4198 (69%) 1634 (55%) 151 (16%)
Yes 49 (0.8%) 85 (2.8%) 3 (0.3%)
Unknown 1835 (30%) 1274 (43%) 771 (83%)

1 Statistics presented: n (%); median (IQR)

Problem 5

The ACC/AHA guideline may recommend initiating or intensifying medication to lower BP for adults with SBP/DBP greater than 130/80. The lower BP threshold has been criticized in many editorials. Using your data, assess the merit of these criticisms:

  1. Create a new dataset where neither rec_bpmeds_acc_aha nor rec_bpmeds_jnc7 have any ‘Unknown’ values.

  2. Create a new variable by uniting the rec_bpmeds_acc_aha with the rec_bpmeds_jnc7 column.

  3. You should have three categories in the new variable. Recode them as follows:

  • No_No –> Not recommended BP medication by either guideline
  • Yes_No –> Recommended BP medication by ACC/AHA only
  • Yes_Yes –> Recommended BP medication by both guidelines
  1. Fit a model to estimate the hazard ratio of being in the 2nd or 3rd group compared to the first. Use this code to fit the model:
# make sure your dataset is called cvd_model
# mdl <- coxph(Surv(cvd_time, cvd_status) ~ rec, data= cvd_model)
  1. Use the tbl_regression() function to summarize your model. Make sure to set exponentiate = TRUE so that tbl_regression() will present hazard ratios. Your table should look like this:
readr::read_rds('solutions/05_solution.rds')
Characteristic HR1 95% CI1 p-value
Recommendation for BP lowering medications
Not recommended BP medication by either guideline
Recommended BP medication by ACC/AHA only 2.34 1.97, 2.78 <0.001
Recommended BP medication by both guidelines 1.85 1.60, 2.14 <0.001

1 HR = Hazard Ratio, CI = Confidence Interval